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Abstract 



rf\ . We consider a basic problem in unsupervised learning: learning an unknown Poisson Binomial Distri- 

bution over {0, 1, . . . , n}. A Poisson Binomial Distribution (PBD) is a sum X = Xi + ■ ■ ■ + Xn of n 
independent Bernoulli random variables which may have arbitrary expectations. We work in a framework 
where the learner is given access to independent draws from the distribution and must (with high probability) 
output a hypothesis distribution which has total variation distance at most e from the unknown target PBD. 

As our main result we give a highly efficient algorithm which learns to e-accuracy using 0{l/e^) sam- 
ples independent of n. The running time of the algorithm is quasilinear in the size of its input data, i.e. 

^»o ' 0{\og{n)/e^) bit-operations (observe that each draw from the distribution is a log(n)-bit string). This is 

f^ , nearly optimal since any algorithm must use ri(l/e^) samples. We also give positive and negative results for 

t^^ ' some extensions of this learning problem. 
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1 Introduction 

We begin by considering a somewhat fanciful scenario: You are the manager of an independent weekly news- 
paper in a city of n people. Each week the i-th inhabitant of the city independently picks up a copy of your 
paper with probability pi. Of course you do not know the values pi, . . . ,p„; each week you only see the total 
number of papers that have been picked up. For many reasons (advertising, production, revenue analysis, etc.) 
you would like to have a detailed "snapshot" of the probability distribution (pdf) describing how many readers 
you have each week. Is there an efficient algorithm to construct a high-accuracy approximation of the pdf from 
a number of observations that is independent of the population n? We show that the answer is "yes." 

A Poisson Binomial Distribution (henceforth PBD) over the domain [n] = {0, 1, . . . ,n} is the familiar 
distribution of a sum X = X]r=i -^i' where Xi, . . . , X„ are independent Bernoulli (0/1) random variables with 
E[Xj] = Pi. The Pi's do not need to be all the same, and thus PBDs generalize the Binomial distribution B{n, p) 
and, indeed, comprise a much richer class of distributions. (See Section [L2l ) 

As PBDs are one of the most basic classes of discrete distributions they have been intensely-studied in 
probability and statistics (see Section [Oi l: we note here that tail bounds on PBDs form an important special 
case of Chernoff/Hoeffding bounds IIChe52[ IHoe63[ IDP09I . In application domains, PBDs have many uses 
in research ai^eas such as survey sampling, case-control studies, and survival analysis, see e.g. IICL97II for 
a survey of the many uses of these distributions in applications. It is thus natural to study the problem of 
learning/estimating an unknown PBD given access to independent samples drawn from the distribution; this is 
the problem we consider, and essentially settle in this paper. 

We work in a natural PAC-style model of learning an unknown discrete probability distribution which is 
essentially the model of iKMR+941 . In this learning framework for our problem, the learner is provided with 



independent samples drawn from an unknown PBD X. Using these samples, the learner must with probability 
1 — 5 output a hypothesis distribution X such that the total variation distance dTvi^,^) i^ ^t most e, where 
e, (5 > are accuracy and confidence parameters that are provided to the learner^] A proper learning algorithm in 
this framework outputs a distribution that is itself a Poisson Binomial Distribution, i.e. a vector p = {pi, . . . ,pn) 
which describes the hypothesis PBD X = Y17=i-^i where E[Xj] = pi. 

1.1 Our results 

Our main result is a highly efficient algorithm for leaimng PBDs from constantly many samples, i.e. quite 
surprisingly, the sample complexity of learning PBDs over [n] is independent ofn. We prove the following: 

Theorem 1 (Main Theorem) Let X = J2i=i -^i ^^ ^" unknown PBD. 

1. [Learning PBDs from constantly many samples] There is an algorithm with the following properties: 
given n and access to independent draws from X, the algorithm uses 0(l/e'^) • log(l/5) samples from X, 
performs 0{\ log n log ^) bit operations, q and with probability 1 — 5 outputs a (succinct description of 
a) distribution X over [n] which is such that d^^vi^^ X) ^ £• 

2. [Properly learning PBDs from constantly many samples] There is an algorithm with the following 
properties: given n and access to independent draws from X, the algorithm uses 0{l/e^) ■ \og{l/5) 
samples from X, performs {l/e)^^^°^ (i/<^)) . O(lognlog^) bit operations, and with probability 1 — 5 
outputs a (succinct description of a) vector p = (pi , . . . , p„) defining a PBD X such that d^v (^) X) ^ £• 

We note that since each sample drawn from X is a log(n)-bit string, the number of bit-operations performed 
by our first algorithm is quasilinear in the length of its input. The sample complexity of both our algorithms is 



' |KMR"'"94| used the Kullback-Leibler divergence as their distance measure but we find it more natural to use variation distance. 
We write O(-) to hide factors which are polylogarithmic in the argument to O(-); thus for example 0(a log 6) denotes a quantity 



which is 0(a log b ■ log'^(a log b)) for some absolute constant c. 



not far from optimal, since r2(l/e^) samples are required even to distinguish the (simpler) Binomial distributions 
B{n, 1/2) and B{n, 1/2 + e/^/n), which have variation distance Q{e). 

Motivated by these strong learning results for PBDs, we also consider learning a more general class of 
distributions, namely distributions of the form X = XlILi ^j^j which are weighted sums of independent 
Bernoulli random variables. We give an algorithm which uses O(logn) samples and runs in poly(n) time if 
there are only constantly many different weights in the sum: 

Theorem 2 (Learning sums of weighted independent Bernoulli random variables) Let X = '^^^^aiXibe 
a weighted sum of unknown independent Bernoullis such that there are at most k different values among 
oi, . . . , a,j. Then there is an algorithm with the following properties: given n, oi, . . . , a„ and access to inde- 
pendent draws from X, it uses k log(n) • 0(l/e^) • log(l/(5) samples from the target distribution X, runs in time 
poly(n'^ • e~^^°^ (i/*:)) . log(l/5), and with probability 1 — 6 outputs a hypothesis vector p G [0, 1]" defining in- 
dependent Bernoulli random variables Xi with E[Xj] = pi such that d'Yy[X, X) < e, where X = Yli=i ^i^i- 

Note that setting all Oj's to 1 in Theorem |2] gives a weaker result than Theorem [T] in terms of running time 
and sample complexity. To complement Theorem |2j we also show that if there are many distinct weights in the 
sum, then even for weights with a very simple structure any learning algorithm must use many samples: 

Theorem 3 (Sample complexity lower bound for learning sums of weighted independent Bernoullis) Let 

X = XlILi i ' ^i be a weighted sum of unknown independent Bernoullis (where the i-th weight is simply i). 
Let L be any learning algorithm which, given n and access to independent draws from X, outputs a hypothesis 
distribution X such that dTvi^i X) < 1/25 with probability at least e~°^^' . Then L must use ^{n) samples. 

1.2 Related work 

Many results in probability theory study approximations to the Poisson Binomial distribution via simpler distri- 
butions. In a well-known result, Le Cam IICam60ll shows that for any PBD X = Yl^=i -^i with E[Xj] = pi 

n 

dTv{X,Foi{pi + ■■■ +pn)) < 2^pI 

1=1 

where Poi(A) denotes the Poisson distribution with parameter A. Subsequently many other proofs of this result 
and similar ones were given using a range of different techniques; IIHC60I IChe741 lDP86l IBHJ92II is a sampling 
of work along these lines, and Steele IISte94|| gives an extensive list of relevant references. Significant work 
has also been done on approximating PBDs by normal distributions (see e.g. IIBer41 [ lEss42[ lMik93[ IVol951 ) 
and by Binomial distributions (see e.g. IIEhm91[[Soo96[|Roo00ll ). These results provide structural information 
about PBDs that can be well-approximated via simpler distributions, but fall short of our goal of obtaining 
approximations of a general, unknown PBD up to an arbitrary accuracy. Indeed, the approximations obtained 
in the probability literature (such as, the Poisson, Normal and Binomial approximations) typically depend on the 
first few moments of the target PBD, while higher moments are crucial for arbitrary approximation MRooOOII . 

Taking a different perspective, it is easy to show (see Section 2 of IIKG71II ) that every PBD is a unimodal 
distribution over [n]. The learnability of general unimodal distributions over [ii] is well understood: Birge 
IIBir87a[|Bir97ll has given a computationally efficient algorithm that can learn any unimodal distribution over [n] 
to variation distance e from 0(log(n)/e'^) samples, and has shown that any algorithm must use Q,{log{n) / e^) 
samples. (The IIBir87all lower bound is stated for continuous unimodal distributions, but the arguments are easily 
adapted to the discrete case.) Our main result. Theorem [H shows that the additional PBD assumption can be 
leveraged to obtain sample complexity independent ofn with a computationally highly efficient algorithm. 

So, how might one leverage the structure of PBDs to remove n from the sample complexity? The first 
property one might try to exploit is that a PBD assigns 1 — e of its mass to Oe(\/n) points. So one could draw 
samples from the distribution to (approximately) identify these points and then try to estimate the probability 



assigned to each such point to within high enough accuracy so that the overall estimation en^or is e. Clearly, such 
an approach, if followed naively, would give poly(n) sample complexity. Alternatively, one could run Birge's 
algorithm on the restricted support of size Oe{\^), but that will not improve the asymptotic sample complexity. 
A different approach would be to construct a small e-cover (under the total variation distance) of the space of all 
PBDs on n variables. Indeed, if such a cover has size N, it can be shown (see Lemma 11 of the full paper, or 
Chapter 7 of MDLOll )) that a target PBD can be learned from 0(log(A^)/e^) samples. Still it is easy to argue that 
any cover needs to have size Q{n), so this approach too gives a log(n) dependence in the sample complexity. 

Our approach, which removes n completely from the sample complexity, requires a refined understanding 
of the structure of the set of all PBDs on n variables, in fact one that is more refined than the understanding 
provided by the aforementioned results (approximating a PBD by a Poisson, Normal, or Binomial distribution). 
We give an outline of the approach in the next section. 

1.3 Our approach 

The starting point of our algorithm for learning PBDs is a theorem of IIDPlli IDasOSl that gives detailed in- 
formation about the structure of a small e-cover (under the total variation distance) of the space of all PBDs 
on n variables (see Theorem |4l). Roughly speaking, this result says that every PBD is either close to a PBD 
whose support is sparse, or is close to a translated "heavy" Binomial distribution. Our learning algorithm ex- 
ploits the structure of the cover to close in on the information that is absolutely necessary to approximate an 
unknown PBD. In particular, the algorithm has two subroutines corresponding to the (aforementioned) different 
types of distributions that the cover maintains. First, assuming that the target PBD is close to a sparsely sup- 
ported distribution, it runs Birge's unimodal distribution learner over a carefully selected subinterval of [n] to 
construct a hypothesis Hs; the (purported) sparsity of the distribution makes it possible for this algorithm to 
use 0{l/e^) samples independent of n. Then, assuming that the target PBD is close to a translated "heavy" 



Binomial distribution, the algorithm constructs a hypothesis Translated Poisson Distribution Hp IIR07I whose 
mean and variance match the estimated mean and variance of the target PBD; we show that Hp is close to 
the target PBD if the latter is not close to any sparse distribution in the cover. At this point the algorithm has 
two hypothesis distributions, Hs and Hp, one of which should be good; it remains to select one as the final 
output hypothesis. This is achieved using a form of "hypothesis testing" for probability distributions. The above 
sketch captures the main ingredients of Part (1) of Theorem [TJ but additional work needs to be done to get the 
proper learning algorithm of Part (2), since neither the spai^se hypothesis Hs output by Birge's algorithm nor 
the Translated Poisson hypothesis Hs is a PBD. Via a sequence of transformations we are able to show that the 
Translated Poisson hypothesis Hp can be converted to a Binomial distribution Bin(n',p) for some n' < n. For 
the sparse hypothesis, we obtain a PBD by searching a (carefully selected) subset of the e-cover to find a PBD 
that is close to our hypothesis Hs (this search accounts for the increased running time in Part (2) versus Part (1)). 
We stress that for both the non-proper and proper learning algorithms sketched above, many technical subtleties 
and challenges arise in implementing the high-level plan given above, requiring a careful and detailed analysis 
which we give in full below. After all, eliminating n from the sample complexity is surprising and warrants 
some non-trivial technical effort. 

To prove Theorem|2]we take a more general approach and then specialize it to weighted sums of independent 
BernouUis with constantly many distinct weights. We show that for any class S of target distributions, if S has 
an e-cover of size N then there is a generic algorithm for learning an unknown distribution from S to accuracy 
e that uses 0((log N)/e'^) samples. Our approach is rather similar to the algorithm of IIDLOlj for choosing a 
density estimate (but different in some details); it works by carrying out a tournament that matches every pair 
of distributions in the cover against each other. Our analysis shows that with high probability some e-accurate 
distribution in the cover will survive the tournament undefeated, and that any undefeated tournament will with 
high probability be 0(e)-accurate. We then specialize this general result to show how the tournament can 
be implemented efficiently for the class S of weighted sums of independent BernouUis with constantly many 
distinct weights. Finally, the lower bound of Theorem [3] is proved by a direct information-theoretic argument. 



1.4 Preliminaries 

For a distribution X supported on [n] = {0, 1, . . . , n} we write X{i) to denote the value Pt[X = i] of the pdf, 
and X{< i) to denote the value Pr[X < i] of the cdf. For S C [n] we write X{S) to denote ^/^^-^(i) and 
Xs to denote the conditional distribution of X restricted to S. 

Recall that the total variation distance between two distributions X and Y over a finite domain D is 

dw(X,X):=(l/2). E |X(a)-y(a)|=max[X(5)-y(5)]. 

Fix a finite domain D, and let V denote some set of distributions over D. Given 5 > 0, a subset Q ^ V 
is said to be a 6-cover of V (w.r.t. total variation distance) if for every distribution PinV there exists some 
distribution Q in Q such that diviP, Q) ^ '^• 

We write S = 5„ to denote the set of all PBDs X = Y^^=i -^i- ^^ sometimes write {X/} to denote the 

PBDX = Er=i^*- 

We also define the Translated Poisson distribution as follows. 



Definition 1 ( JRO?! ) We say that an integer random variable Y has a translated Poisson distribution with pa- 



rameters fi and (T^, written Y = TP{fi, a"^), ifY= [fJ, — u^J + Poisson{a'^ + {/i — (T^}), where {jj. — o"^} 
represents the fractional part of fi — a"^. 

Translated Poisson distributions are useful to us because known results bound how far they are from PBDs 
and from each other. We will use the following results: 



Lemma 1 (see (3.4) of IRQ? I ) Let Ji, . . . , Jnbe a sequence of independent random indicators with E[Jj] = pi 
Then 






where fi = J27=i Pi ^"^ ^^ = Y17=i k(1 " Pi)- 

Lemma 2 (Lemma 2.1 of IIBL061 ) Let i.ii,fi2 G M and al^al G M+ \ {0} be such that \_iii — erf J < [/X2 - crlJ. 
Then 

dMTP{l^i,<yl),TPi^2,aj)) <\j^l^M+ \-"-f + \ 

2 Learning an unknown sum of BernouUis from poly (1/e) samples 

In this section we prove our main result, Theorem[Tl by giving a sample- and time-efficient algorithm for learning 
an unknown PBD X = YJi=i ^i- 

A cover for PBDs. An important ingredient in our analysis is the following theorem, which is an extension of 
Theorem 9 of the full version of BDPlll . It defines a cover (in total variation distance) of the space 5 = 5„ of 
all order-n PBDs: 

Theorem 4 (Cover for PBDs) For all e > 0, there exists an e-cover S^ CSofS such that 

L \S,\ < n' . 0(l/e) + n • (i)^^^°^' ^^'^ and 

2. The set S^ can be constructed in time linear in its representation size, i.e. 0{rv' /e) + 0{n) • (-) 

Moreover, if {Yi} G Se, then the collection \Yi} has one of the following forms, where k = k{e) < C/e is a 
positive integer, for some absolute constant C > 0: 

4 



(i) (Sparse Form) There is a value i < k^ = O {I /e^) such that for alii < i we have E[Yi] G < -r^, p-, . . . , , ^ >, 
and for all i > I we have E[l^] G {0, 1}. 

(ii) (k-heavy Binomial Form) There is a value £ G {0, 1, . . . , n} and a value q G {^, ^, . . . , -^^} such 
that for alii < I we have E[yi] = q; for all i > £ we have E[Yi\ G {0, 1}; and i, q satisfy the bounds 
£q>k'^-l and £q{l - q) > k^ - k - I - '^. 

Finally, for every {Xi} G Sfor which there is no e-neighbor in S^ that is in sparse form, there exists a collection 
{Yi} G Se in k-heavy Binomial form such that 

(Hi) dTv{J2i ^i^J2i^) < e." '^'^d 

(iv) if 11 = EE-Xi], ^JL' = ^[Y.^Y^], ^2 = VarE- X,] anda'^ = Y&r[Y,iY^], then \^x - /i'| = 0(e) and 
|a2-a'2| =0(l + e-(l + a2)). 

We remai^k that HDasOSII establishes the same theorem, except that the size of the cover is n^ ■ 0(l/e) + n ■ 

(i) . Indeed, this weaker bound is obtained by including in the cover all possible collections {Yi} G S 

in spai^se form and all possible collections in fc-heavy Binomial form, for k = 0(l/e) specified by the theorem. 
HDPllll obtains a smaller cover by only selecting a subset of the collections in sparse form included in the cover 
of MDasOSi Finally, the cover theorem stated in HDasOSiPPllll does not include the part of the above statement 
following "finally." We provide a proof of this extension in Section |4~T] 

We remark also that our analysis in this paper in fact establishes a slightly stronger version of the above 
theorem, with an improved bound on the cover size (as a function of n) and stronger conditions on the Binomial 
Form distributions in the cover. We present this strengthened version of the Cover Theorem in Section |42l 

The learning algorithm. Our algorithm Learn-PBD has the general structure shown below (a detailed version 
is given later). 



Learn-PBD 

1. Run Learn-Sparse''*^(n, e, (5/3) to get hypothesis distribution Hs- 

2. Run Learn-Poisson"''^(n, e, 6/3) to get hypothesis distribution Hp. 

3. Return the distribution which is the output of Choose-Hypothesis"^(-ff5, Hp, e, 6/3). 



Figure 1: Learn-PBD 

The subroutine Learn-Sparse^ is given sample access to X and is designed to find an e-accurate hypothesis 
if the target PBD X is e-close to some sparse form PBD inside the cover S^; similarly, Learn-Poisson^ 
is designed to find an e-accurate hypothesis if X is not e-close to a sparse form PBD (in this case. Theorem |4] 
implies that X must be e-close to some fc(e)-heavy Binomial form PBD). Finally, Choose-Hypothesis"^ 
is designed to choose one of the two hypotheses Hs,Hp as being e-close to X. The following subsections 
describe and prove correctness of these subroutines. We remark that the subroutines Learn-Sparse and 
Learn-Poissondo not return the distributions Hs and Hp as a list of probabilities for every point in [n]; 
rather, they return a succinct description of these distributions in order to keep the running time of the algorithm 
logarithmic in n. 

2.1 Learning when X is close to a Sparse Form PBD 

Our starting point here is the simple observation that any PBD is a unimodal distribution over the domain 
{0,1,..., n} (there is a simple inductive proof of this, or see Section 2 of KG? 111 ). This will enable us to use 
the algorithm of Birge IIBir97ll for learning unimodal distributions. We recall Birge's result, and refer the reader 
to Section [5] for an explanation of how Theorem |5] as stated below follows from ||Bir97ll . 



Theorem 5 ( IIBir97ll ) For all n,e,6 > 0, there is an algorithm that draws -^^^ • 0(log j) samples from an 
unknown unimodal distribution X over [n], does O ( "^3" log ^ I bit-operations, and outputs a (succinct de- 
scription of a) hypothesis distribution H over [n] that has the following form: H is uniform over subintervals 
[oi, 61], [a2, 62]) • • • ) [o-fc) bk\, whose union U^^^[aj, 6j] = [n], where k = O ( -^^ 1 . In particular, the algorithm 
outputs the lists ai through ak and 61 through bk, as well as the total probability mass that H assigns to each 
subinterval [ai, bi], i = 1, . . . , A:. Finally, with probability at least 1 — 5, riTy(^, H) < e. 

In the rest of this subsection we prove the following: 

Lemma 3 Foralln,e',6' > 0, thereis analgorithmLearn-Sparse^{n,e',6') thatdraws 0{-i^log^log-gr) 
samples from a target PBD X over [n], does logn • O {^-p^ log jr)-bit operations, and outputs a (succinct de- 
scription of a) hypothesis distribution Hs over [n] that has the following form: its support is contained in an 
explicitly specified interval [a, b] C [n], where \b — a\ = 0(l/e'^, and for every point in [a, b] the algorithm 
explicitly specifies the probability assigned to that point by Hs- u Moreover, the algorithm has the following 
guarantee: Suppose X is e' -close to some sparse form PBD Y in the cover S^:' of Theorem^ Then, with proba- 
bility at least 1 — 5', div{X,Hs) < cie' , for some absolute constant ci > 1, and the support of H s is a subset 
of the support ofY. 

Proof: The Algorithm Learn-Sparse'''"(n, e', 5') works as follows: It first draws M = 321og(8/(5')/e'^ 
samples from X and sorts them to obtain a list of values < si < • • • < sm < n- In terms of these samples, let 
us define a := sp2e'M] ^nd b := S[(i_2e')A/j- We claim the following: 

Claim 4 With probability at least 1 - 6' /2, we have X{< a) £ [3e72, 5e'/2] and X{< 6) G [1 - 5e'/2, 1 - 
3672]. 

Proof of Claim|4l We only show that X{< a) > 3e'/2 with probability at least 1 — 6' /8, since the arguments 

for X{< a) < 5e72, X{< b) < 1 - 3e72 and X{< b) > 1 - 5e' /2 ai'e identical. Given that each of these 
conditions is met with probability at least 1 — ^78, the union bound establishes our claim. 

To show that X{< a) > 3e72 is satisfied with probability at least 1 — 5' /8 we argue as follows: Let 
a' = max{i | X{< i) < 3e72}. Cleai'ly, X{< a') < 3e'/2 while X{< a' + I) > 3e72. Given this, of 
M samples drawn from X an expected number of at most 3e'M/2 samples ai^e < a'. It follows then from the 
Chernoff bound that the probability that more than |e'Af samples are < a' is at most e"^'^ /^^ ^^/^ < 6' /8. 
Hence, a > a' + 1, which implies that X{< a) > 3e72. ■ 

If 6 — a > (C/e')^, where C is the constant in the statement of Theorem |4l the algorithm outputs "fail", 
returning the trivial hypothesis which puts probability mass 1 on the point 0. Otherwise, the algorithm runs 
Birge's unimodal distribution learner (Theorem |5]l on the conditional distribution X,^ y, and outputs the result 
of Birge's algorithm. Since X is unimodal, it follows that X,^ ^ is also unimodal, hence Birge's algorithm is 
appropriate for learning it. The way we apply Birge's algorithm to learn Xr. ^ given samples from the original 

distribution X is the obvious one: we draw samples from X, ignoring all samples that fall outside of [a, b], until 
the right 0{log{l/6') log(l/e')/e'^) number of samples fall inside [a, b], as required by Birge's algorithm for 
learning a distribution of support of size {C/e')^ with probability 1 — 6' /A. Once we have the right number 
of samples in [a, b], we run Birge's algorithm to learn the conditional distribution X,^ ^,. Note that the number 

of samples we need to draw from X until the right 0{log{l/6') log(l/e')/e''^) number of samples fall inside 
[a, b] is still 0{log{l/6') log(l/e')/e'3), with probability at least 1 - 6'/4. Indeed, since X{[a, b]) = I - O(e'), 
it follows from the Chernoff bound that with probability at least 1 - 6' /A, if K = 6(log(l/y) log(l/e')/e'^) 
samples are drawn from X, at least K{1 — 0{e')) fall inside [a, b]. 



^In particular, our algorithm will output a list of pointers, mapping every point in [a, b] to some memory location where the probability 
assigned to that point by Hs is written. 



Analysis: It is easy to see that the sample complexity of our algorithm is as promised. For the running time, 
notice that, if Birge's algorithm is invoked, it will return two lists of numbers ai through a^ and 61 through 
bk, as well as a list of probability masses qi, . . . ,qk assigned to each subinterval [oj, bj], i = 1, . . . ,k,hy the 
hypothesis distribution Hs, where k = 0(log(l/e')/e'). In linear time, we can compute a list of probabilities 
(71, ... , Qk, representing the probability assigned by Hs to every point of subinterval [ai, bi], for z = 1, . . . ,k. 
So we can represent our output hypothesis Hs via a data structure that maintains 0{l/e'^) pointers, having 
one pointer per point inside [a,b]. The pointers map points to probabilities assigned by Hs to these points. 
Thus turning the output of Birge's algorithm into an explicit distribution over [a, b] incurs linear overhead in our 
running time, and hence the running time of our algorithm is also as promised. Moreover, we also note that the 
output distribution has the promised structure, since in one case it has a single atom at and in the other case it 
is the output of Birge's algorithm on a distribution of support of size (C/e')^. 

It only remains to justify the last part of the lemma. Let Y be the sparse-form PBD that X is close to; 
say that Y is supported on {a', . . . ,b'} where b' — a' < (C/e')^. Since X is e'-close to Y in total variation 
distance it must be the case that X{< a' — 1) < e'. Since X{< a) > 3e'/2 by Claim HI it must be the 
case that a > a'. Similar- ai^guments give that b < b'. So the interval [a, b] is contained in [a', b'] and has 
length at most (C/e')^. This means that Birge's algorithm is indeed used correctly by our algorithm to learn 
Xr. ^, with probability at least 1 — 5' /2 (that is, unless Claim |4] fails). Now it follows from the correctness of 
Birge's algorithm (Theorem |5]) and the discussion above, that the hypothesis Hs output when Birge's algorithm 
is invoked satisfies dTv{Hs, Xr- g,) < e', with probability at least 1 — 6' /2, i.e. unless either Birge's algorithm 

fails, or we fail to get the right number of samples landing inside [a,b]. To conclude the proof of the lemma we 
note that: 

i€[a,b] i^la-M 

= y I — ^-^xii)-xii)\+ y x{i) 

^. \x{%b\) I ^. ' 



ie[a.,b] 

JG[a,b] 

So the triangle inequality gives: d-£v{Hs, X) = O(e'), and Lemma|3]is proved. ■ 

2.2 Learning when X is close to a A;-heavy Binomial Form PBD 

Lemma 5 ForaUn,e\5' > 0, thereisanalgorithmLearn-'PoJ-Sson^{n,e',6') thatdraws 0{log{l/6')/e'^) 
samples from a target PBD X over [n], runs in time 0(log n • log(l/y )/e'^), and returns two parameters jl and 
G^. Moreover, the algorithm has the following guarantee: Suppose X is not e'-close to any Sparse Form PBD 
in the cover S^i of Theorem |?] Let Hp be the translated Poisson distribution with parameters fi and a"^, i.e. 
Hp = TP{ji,a'^). Then with probability at least 1 — 6' we have d^vi^iHp) < C2e', for some absolute 
constant C2 > 1. 

Our proof plan is to exploit the structure of the cover of Theorem ID In particular, if X is not e'-close to any 
Sparse Form PBD in the cover, it must be e'-close to a PBD in Heavy Binomial Form with approximately the 
same mean and variance as X, as specified by the final part of the cover theorem. Now, given that a PBD in 
Heavy Binomial Form is just a translated Binomial distribution, a natural strategy is to estimate the mean and 
variance of the target PBD X and output as a hypothesis a translated Poisson distribution with these parameters. 
We show that this strategy is a successful one. 
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We start by showing that we can estimate the mean and variance of the target PBD X. 

Lemma 6 For all n,e,6 > 0, there exists an algorithm A{n, e, 5) with the following properties: given access to 
a PBD X over [n], it produces estimates ji and cr^ for fi = E[X] and a^ = Var[X] respectively such that with 
probability at least 1 — 6: 



Im — Al < e ■ a and \a — a\<e-a\4:-\ — ^. 

The algorithm uses 0(log(l/5)/e^) samples and runs in time 0(lognlog(l/(5)/e^). 

Proof of Lemma |6l We treat the estimation of p, and cr^ separately. For both estimation problems we show 
how to use 0(l/e^) samples to obtain estimates /x and a^ achieving the required guarantees with probability 
at least 2/3. Then a routine procedure allows us to boost the success probability to 1 — 5 at the expense of a 
multiplicative factor 0(log l/d) on the number of samples. While we omit the details of the routine boosting 
argument, we remind the reader that it involves running the weak estimator 0(log 1/6) times to obtain estimates 
jli, . . . , Po{\og 1/5) ^rid outputting the median of these estimates, and similarly for estimating cr^. 
We proceed to specify and analyze the weak estimators for p and cr^ separately: 

• Weak estimator for fi: Let Zi, . . . , Zm be independent samples from X, and let fi = ' ' . Then 

E[/i] = /i and Var[/il = — Var[X] = —a^. 

m m 

So Chebyshev's inequality implies that 

Fv[\fi-fi\>ta/^]<-^. 
Choosing t = -y/S and m = 3/e^, the above imply that l/x — /i| < ea with probability at least 2/3. 

• Weak estimator for a'^: Let Zi, . . . , Zm be independent samples from X, and let a^ = — - — ' ^-^ ' — 
be the unbiased sample variance (note the use of Bessel's correction). Then it can be checked IIJoh03ll that 

E[a2] = ^2 and Varfa^] = a' ( ^— + -) , 

ym — 1 mj 

where k is the kurtosis of the distribution of X. To bound k in terms of o"^ suppose that X = Y17=i -^i^ 
where E[Xj] = pi for all i. Then 



= ^V(l-6p,(l-pi))(l-pi)pi (see JNJ051 ) 

i 



i 



1 V^/-, N 1 

< —2_^0--Pi)Pi = — • 



So Var[(T2] = (T^ [ ;^ + ^ j < ^(4 + ^). So Chebyshev's inequality impUes that 

Pr 



^2 



,.2 2, CT I 1 

'm V (J^ 



1 



Choosing t = \/3 and m = "ije^, the above imply that \o^ — a'^\ < ea"^ J 4, + -^ with probability at least 

2/3. 



Proof of Lemma HI Suppose now that X is not e'-close to any PBD in sparse form inside the cover S^' of 
TheoremlH Then there exists a PBD Z ink = A;(e')-heavy Binomial form inside S^' that is within total variation 
distance e' from X. We use the existence of such a. Z to obtain lower bounds on the mean and variance of X. 
Indeed, suppose that the distribution of Z is Bin(£, q) + t, i.e. a Binomial with parameters £, q that is translated 
by t. Then Theorem |4] certifies that the following conditions are satisfied by the parameters i, q,t, fi = E[X] 
and 0-2 = Var[X]: 



1. 



1_ 3. 

^ fc' 



(a) £q > /c2 

(b) iq{l -q)>P-k 

(c) \t + iq- fi\ =0(e');and 

(d) \iq{l-q)-a^\=Oil + e-il+a^)). 

In particular, conditions (b) and (d) above imply that 

0(A;2) = 0(l/e'2) 



a 



> 



(1) 



for some universal constant 6. Hence we can apply Lemma|6]with e = eV\/4 + ^ and 6 = S' to obtain — from 

0(log(l/(5')/e'^) samples and with probability at least 1 — 6' — estimates fl and a"^ of /i and a"^ respectively that 
satisfy 

Ia* — Al < e' ■ a and |o"^ — ct^I < e' ■ cr^. (2) 

Now let y be a random variable distributed according to the translated Poisson distribution TP{fi, a^). We 
conclude the proof of Lemma[5]by showing that Y and X are within O(e') in total variation distance. 

Claim 7 IfX and Y are as above, then dTv(-^i Y) < O(e'). 

Proof of Claim 13 We make use of Lemma [T] Suppose that X = Yl^=i-^i' where E[Xj] = pi for all i. 
Lemma n implies that 



dTviX,TPifi,a^))< 



< 



1 r. 




yl^^Pn^-P^) + ^ ^ ^/E^P^a-P^) + 2 




YliPiC^-Pi) ^ EiK(i-Pi) 




1 2 12 

VT.^Pii^-P^)^ EiPii^-Pi) ^ ' ^' 


-0{e' 



(3) 



It remains to bound the total variation distance between the translated Poisson distributions TP{fi,a'^) and 
TP{ft, (T^). For this we use Lemma[2l Lemma |2] implies 



dTviTP{^t,a''),TP{fi,a^)) < 



< 



< 



min((T, a] 



min((T, a] 



+ 



+ 



w 



<t2| + 1 



e^ • CT^ + 1 
min((T^,(T2) 



eV 



^^ + 1 



(J 



1^ 



aVU 



0{e') + ^ 
O(e') + 0(e'2) 
O(e'). 



(4) 



The claim follows from Q, (01) and the triangle inequality. This concludes the proof of Lemma|5]as well. ■ 

As a final remai^k, we note that the algorithm described above does not need to know a priori whether or not X is 
e'-close to a PBD in sparse form inside the cover S^/ of Theorem |4l The algorithm simply runs the estimator of 

Lemma[6]with e = ^' / \ r^ + ^ and 8' = 5 and outputs whatever estimates jl and a'^ the algorithm of Lemma[6] 
produces. 

2.3 Hypothesis testing 

Our hypothesis testing routine Choose-Hypothesis^ runs a simple "competition" to choose a winner be- 
tween two candidate hypothesis distributions Hi and H2 over [n] that it is given in the input either explicitly, or 
in some succinct way. We show that if at least one of the two candidate hypotheses is close to the target distribu- 
tion X, then with high probability over the samples drawn from X the routine selects as winner a candidate that 
is close to X. This basic approach of running a competition between candidate hypotheses is quite similar to 
the "Scheffe estimate" proposed by Devroye and Lugosi (see l|DL96b[|bL96al and Chapter 6 of MDLOlll ). which 
in turn built closely on the work of l|Yat85ll . but there are some small differences between our approach and 
theirs; the MDLOlll approach uses a notion of the "competition" between two hypotheses which is not symmetric 
under swapping the two competing hypotheses, whereas our competition is symmetric. We obtain the following 
lemma, postponing all running-time analysis to the next section. 

Lemma 8 There is an algorithm Choose-Hypothesis"''"(i/i, iir2; e'j ^') which is given oracle access to X, 
two hypothesis distributions Hi, H2for X, an accuracy parameter e', and a confidence parameter 5' . It makes 
m = 0{log{l/5')/e'^) draws from X and returns some H G {Hi, H2}. If one of Hi, H2 has div{Hi,X) < e' 
then with probability 1 — 5' the H that Choose-Hypothesis returns has d"[v{H, X) < 6e'. 

Proof: Let W be the support of X. To set up the competition between Hi and H2, we define the following 
subset of W: 

Wi = Wi{Hi,H2) ■.= {weW\Hi{w) > H2{w)}. (5) 

Let then pi = Hi{yVi) and qi = 7^2 (VVi). Clearly, pi > qi and d'rv{Hi,H2) = Pi — qi- 
The competition between Hi and H2 is earned out as follows: 

1. If Pi — qi < 5e', declai^e a draw and return either Hi. Otherwise: 

2. Draw m = O ( °^^^,/ ' ) samples si, . . . ,Sm from X, and let r = ^|{i | Sj G Wi}| be the fraction of 
samples that fall inside VVi . 

3. If r > pi — |e', declare Hi as winner and return Hi; otherwise, 

4. if r < Qi + |e', declare H2 as winner and return H2; otherwise, 

5. declare a draw and return either Hi. 

It is not hard to check that the outcome of the competition does not depend on the ordering of the pair of 
distributions provided in the input; that is, on inputs {Hi, H2) and {H2,Hi) the competition outputs the same 
result for a fixed sequence of samples si, . . . ,Sm drawn from X. 

The correctness of Choose-Hypothesis is an immediate consequence of the following claim. (In fact 
for Lemma[8]we only need item (i) below, but item (ii) will be handy later in the proof of Lemma [TT]) 

Claim 9 Suppose that d'TY{X, Hi) < e'. Then: 

10 



(i) IfdTvi^i H2) > 6e', then the probability that the competition between Hi and H2 does not declare Hi 
as the winner is at most e^™*^ ' ^. (Intuitively, if H2 is very bad then it is very likely that Hi will be 
declared winner) 

(ii) Ifd'-£Y{X, H2) > 4e', the probability that the competition between Hi and H2 declares H2 as the winner 
is at most e"™*^ ' ^. (Intuitively, if H2 is only moderately bad then a draw is possible but it is very unlikely 
that H2 will be declared winner.) 

Proof: Let r = X{Wi). The definition of the total variation distance implies that |r — pi| < e'. Let us 
define the 0/1 (indicator) random variables {Zj}f^^ as Zj = 1 iff Sj G Wi. Clearly, r = ^ Y.T=i ^j 
and E[r] = K[Zj] = r. Since the Zfs are mutually independent, it follows from the Chernoff bound that 

Pr[r < r - e' /2] < e-'^'"/^. Using |r - pi| < e' we get that Pr[T < pi - ^e' /2\ < e'""'"'/^. Hence: 

• For part (i): If dTvi^, -^2) > 6e', from the triangle inequality we get that pi — qi = d^viHi, H2) > 5e'. 
Hence, the algorithm will go beyond step 1, and with probability at least 1 — e~™^ '^, it will stop at step 
3, declaring Hi as the winner of the competition between Hi and H2. 

• For part (ii): If pi — qi < 5e' then the competition declares a draw, hence H2 is not the winner. Otherwise 
we have pi — qi > 5e' and the above arguments imply that the competition between Hi and H2 will 
declare H2 as the winner with probability at most e"™*^ /^. 

This concludes the proof of Claim|9]and of Lemma[8] ■ 

2.4 Proof of Theorem [1] 

We first treat Part (1) of the theorem, where the learning algorithm may output any distribution over [n] and not 
necessarily a PBD. Our algorithm has the structure outlined in Figure[I]with the following modifications: (a) if 
the target total variation distance is e, the second argument of both Learn-Sparse and Learn-Poissonis 
^®'- ^*^ i2max'ic c V whcrc ci and C2 are respectively the constants from Lemmas [3] and [51 (b) we replace the third 

step with Choose-Hypothesis^(-ff5, Hp, e/8, 5/3), where Hp is defined in terms of Hp as described be- 
low. If Choose-Hypothes is returns Hs, then Learn-PBD also returns Hs, whileif Choose-Hypothesis 
returns Hp, then Learn-PBD returns Hp. We proceed to the definition of Hp. 

Definition of Hp: For every point i where Hs{i) = 0, we let Hp{i) = Hp{i). For the points i where 
Hs{i) / 0, in Theorem |7] of Section [6] we describe an efficient deterministic algorithm that numerically ap- 
proximates Hp{i) to within an additive ibe/24s, where s = 0(l/e^) is the cardinality of the support of Hs. 
We define Hp{i) to equal the approximation to Hp{i) that is output by the algorithm of Theorem |7] Observe 
that Hp satisfies d^viHp, Hp) < e/24, and therefore \dTv{Hp, X) - d^vi^, Hp)\ < e/24. In particular, if 
dTv{X,Hp) < ^,thendTv{X,Hj^) < ^,andif dTv{X,H^) < f , then dTy(X,i/p) < e. 

We do not use Hp directly in Choose-Hypothesis because of computational considerations. Since Hp 
is a translated Poisson distribution, we cannot compute its values Hp{i) exactly, but using approximate values 
may cause Choose-Hypothesis to make a mistake. So we use i/p instead of /7p in Choose-Hypothesis; 
Hp is carefully designed both to be close enough to Hp so that Choose-Hypothesis will select a probability 
distribution close to the target X, and to allow efficient computation of all probabilities that Choo se-Hypothe sis 
needs without much overhead. In particular, we remark that in running Choose-Hypothesis we do not a 
priori compute the value of Hp at every point; we do instead a lazy evaluation of Hp, as explained in the 
running-time analysis below. 

We proceed now to the analysis of our modified algorithm Learn-PBD. The sample complexity bound 
and con^ectness of our algorithm are immediate consequences of Lemmas [3l [5] and [H taking into account the 
precise choice of constants and the distance between Hp and Hp. To bound the running time. Lemmas [3] 
and [5] bound the running time of Steps 1 and 2 of the algorithm, so it remains to bound the running time of 
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the Choose-Hypothesis step. Notice that 'Wi{Hs, Hp) is a subset of the support of the distribution Hs- 
Hence to compute Wi{Hs, Hp) it suffices to determine the probabilities Hs{i) and Hp{i) for every point i in 
the support of Hs- For every such i, Hs{i) is explicitly given in the output of Learn-Sparse, so we only 
need to compute Hp{i). Theorem |7] implies that the time needed to compute Hp{i) is 0(log'^(l/e) + logn + 
lAl + I^^D' where |/i| and \a'^\ are respectively the description complexities (bit lengths) of fi and a^. Since 
these parameters are output by Learn-Poisson,by inspection of that algorithm it is easy to see that they are 
each at most 0(log n + log log(l/(5) + log(l/e)). Hence, given that the support of Hs has cardinality 0{l/e^), 
the overall time spent computing all probabilities under Hp is 0(^ logn log ^). After Wi is computed, the 

computation of the values pi = Hs{yVi), qi = i7p(>Vi) and pi — qi takes time linear in the data produced by 
the algorithm so far, as these computations merely involve adding and subtracting probabilities that have already 
been explicitly computed by the algorithm. Computing the fraction of samples from X that fall inside Wi takes 
time O (logn • log(l/5)/e^) and the rest of Choose-Hypothesis takes time linear in the size of the data 
that have been written down so far. Hence the overall running time of our algorithm is 0(^ logn log ^). This 
gives Part (1) of Theorem [l] 

Next we turn to Part (2) of Theorem [T] the proper leai^ning result. We explain how to modify the algorithm 
of Part (1) to produce a PBD that is within 0(e) of the target X. We only need to add two post-processing steps 
converting Hs and Hp to PBDs; we describe and analyze these two steps below. For convenience we write c to 
denote max{ci, C2} > 1 in the following discussion. 

1. Locate-Sparse(ii^5', ^)- This routine searches the sparse-form PBDs inside the cover S_l- to iden- 
tify a sparse-form PBD that is within distance | from Hs, or outputs "fail" if it cannot find one. Note 
that if there is a sparse-form PBD Y that is j^^-close to X and Learn-Sparse succeeds, then Y 
must be |-close to Hs, since by Lemma [3] whenever Learn-Sparse succeeds the output distribu- 
tion satisfies dTv{X,Hs) < j2- ^^ show that if there is a spai'se-form PBD Y that is j^-close to 
X and Learn-Sparse succeeds (an event that occurs with probability 1 — 6/3, see LemmaO, our 
Locate-Sparse search routine, described below, will output a sparse-form PBD that is |-close to Hs- 
Indeed, given the preceding discussion, if we searched over all sparse-form PBDs inside the cover, it 
would be trivial to meet this guarantee. To save on computation time, we prune the set of sparse-form 

PBDs we search over, completing the entire search in time (i) log n log 1/5- 

Here is a detailed explanation and run-time analysis of the improved search: First, note that the description 
complexity of Hs is poly(l/e) • 0(lognlog(l/5)) as Hs is output by an algorithm with this running 
time. Moreover, given a sparse-form PBD in S_^ , we can compute all probabilities in the support of the 
distribution in time poly(l/e) logn. Indeed, by part (i) of Theorem |4] a sparse-form PBD has 0{l/e^) 
non-tiivial Bernoulli random variables and those each use probabilities pi that are integer multiples of 
some value which is ri(e^). So an easy dynamic programming algorithm can compute all probabilities 
in the support of the distribution in time poly(l/e) log n, where the log n overhead is due to the fact that 
the support of the distribution is some interval in [n] . Finally, we argue that we can restrict our search 
to only a small subset of the sparse-form PBDs in S_l. . For this, we note that we can restrict our search 
to sparse-form PBDs whose support is a superset of the support of Hs- Indeed, the final statement of 
Lemma [3] implies that, if Y is an arbitraiy spai^se-form PBD that is j^-close to X, then with probability 
1 — 6/3 the output Hs of Learn-Sparse will have support that is a subset of the support of Y- Given 

this, we only need to try (i) "^ sparse-form PBDs in the cover to find one that is close to Hs- 

Hence, the overall running time of our search is (^) "^ 0(log n log 1/5)- 

2- Locate-Binomial(/i, a'^,n): This routine tries to compute a Binomial distribution that is 0(e)-close 
to Hp (recall that Hp = TP{j1,ct'^). Analogous to Locate-Sparse, we will show that if X is not 
jl^-close to any sparse-form distribution inside S_l. and Learn-Poisson succeeds (for convenience 
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we call these conditions our "working assumptions" in the following discussion), then the Binomial dis- 
tribution output by our routine will be 0(e)-close to Hp and thus 0(e)-close to X. 

Let jl and a^ be the parameters output by Learn-Poisson, and let fi and a"^ be the (unknown) mean 
and variance of the tai^get X. Our routine has several steps. The first two steps eliminate comer-cases in 
the values jl and a'^ computed by Learn-Poisson, while the last step defines a Binomial distribution 
B{n,p) with h < n that is close to Hp = TP{il, a^) under our working assumptions. (We note that a 
significant portion of the work below is to ensure that h < n, which does not seem to follow from a more 
direct approach. Getting n < n is necessary in order for our learning algorithm for order-n PBDs to truly 
be proper.) Throughout (a), (b) and (c) below we assume that our working assumptions hold (note that 
this assumption is being used every time we employ results such as ([T]) or Q from Section |2!2l ). 

(a) Tweaking a'^: If a'^ < j then set af = a'^, and otherwise set erf = j. We note for future reference 
that in both cases Equation Q gives 

al < {1 + 0{e))a\ (6) 

We claim that this setting of af results in dTv{TP{fi,a'^),TP{il,al)) < 0(e). If a^ < | then 
this variation distance is zero and the claim certainly holds. Otherwise we have the following (see 
Equation ([2])): 

n 
i=l 

Hence, by Lemma[2]we get: 

dw(rp(A,<T2),TP(A,a?)) < \^'-f + ^ < 9M^ = o(6), (7) 

where we used the fact that cr^ = $7(l/e^) (see ([T])). 

(b) Tweaking af: If /i^ < n{fi — af) then set it| = a^ and otherwise set (T2 = "^"^ . We claim that 
this results in dxy (rP(/i, af),TP{fj,, al)) < 0{e). If p? < n{jl — af) then as before the variation 
distance is zero and the claim holds. Otherwise, we observe that af > 02 and cr| > (the last 
assertion follows from the fact that ft must be at most n). So we have (see Q) that 

Ia* - Al < 0(e)o- < 0(e)/i, (8) 

n — fi>n — IX — 0{e)a. (9) 



which implies 

We now observe that 



1^^ = \J2p'] ^'^{^Pf) ='^if^ 



a'] 



where the inequality is Cauchy-Schwarz. Rearranging this yields 



^("-^)>a2. (10) 



n 
We now have that 

n n 
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where the first inequaUty follows from ([Hi and ^ and the second follows from (fTOl) and the fact that 
any PBD over n variables satisfies fi < n. Hence, by Lemma|2]we get: 

dw(mM,<T,),rp(^,a2))< — -, — < (i-o(.)V2-o(.)a 



(1 - 0(6))a2 



< Tr^7Tn^ = 0{e), (12) 



where we used the bound a"^ = J7(l/e^) (see ([T|l). 

(c) Constructing a Binomial Distribution: We construct a Binomial distribution Hb that is 0(e)-close 
to TP{fi,a2). If we do this then we have d-rviHE^Hp) = 0(e) by ©, ^ and the triangle 
inequality. The Binomial distribution Hb we construct is Bm{n,p), where: 



n 



-2 



fJ'^ 



2 



and p — ^ 



/i-cj|j - /i 



Note that by the way o"! is set in step (b) above we indeed have n < n as claimed in Part 2 of 

Theorem [U 

Let us bound the total variation distance between Bin(n,p) and TP(/i, o"!). Using Lemma [T] we 

have: 

1 2 

dT\/(Bin(n,p),rP(n]5,np(l - p)) < + -—- -. (13) 

^ynp{l-p) np{l-p) 



Notict 


jthat 










hp{l 


-P)> 


V/U-O-2 


-1 
/ 


V A J 


(-1 



= ^i - m - P) > (1 - 0(e))a2 - 1 = f](l/e2), 

where the next-to-last step used (ITTI ) and the last used the fact that a"^ = 0(l/e^) (see ([T]l- So 
plugging this into (IT3] ) we get: 

dTv(Bin(n,p),TP(np, np(l — p)) = 0(e). 

The next step is to compai^e TP{hp, hp{l — p)) and TP{il, a"^)- Lemma |2] gives: 

\np — ji\ |np(l — p) — (t|| + 1 



d^v{TP{hp,np{l-p)),TP{fx,ai)) < 



i^/np{l -p), 02) min(np(l - p), cr|) 



mm 

< , ^ + -^ r = 0{e). 

y/hp{l-p) np{l-p) 

By the triangle inequality we get (iTv(Bin(?'i,p), TP(/z, o"!) = 0(e), which was our ultimate goal. 

Given the above Locate-SparseandLocate-Binomialroutines, the algorithm Proper-Learn-PBD 
has the following structure: It first runs Learn-PBD with accuracy parameters e, 6. If Learn-PBD returns 
the distribution Hs computed by subroutine Learn-Sparse, then Proper-Learn-PBD outputs the result 
of Locate-Sparse(//5, -^)- If, on the other hand, Learn-PBD returns the translated Poisson distribution 
Hp = TP{fi, (T^) computed by subroutine Learn-Poisson, then Proper-Learn-PBD returns the Bino- 
mial distribution constructed by the routine Locate-Binomial(/i, a^, n). It follows from the correctness of 
Learn-PBD and the above discussion that, with probability 1 — 6, the output of Proper-Learn-PBD is 
within total variation distance 0(e) of the target X. The number of samples is the same as in Learn-PBD, and 

the running time is (-) ^^ • 0(log n log l/d). 



This concludes the proof of Part 2 of Theorem [T] and thus of the entire theorem. 
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3 Learning weighted sums of independent Bernoullis 

In this section we consider a generalization of the problem of learning an unknown PBD, by studying the 
learnability of weighted sums of independent Bernoulli random variables X = XlILi '^i-^i- (Throughout this 
section we assume for simplicity that the weights are "known" to the learning algorithm.) In Section lTTl we show 
that if there are only constantly many different weights then such distributions can be learned by an algorithm 
that uses 0(log n) samples and runs in time poly(n). In Section [3^ we show that if there are n distinct weights 
then even if those weights have an extremely simple structure - the i-th weight is simply i - any algorithm must 
use 17 (n) samples. 

3.1 Learning sums of weighted independent Bernoulli random variables with few distinct weights 

Recall Theorem |2l 

Theorem |2]Le? X = X]"=^ a-iXi be a weighted sum of unknown independent Bernoulli random variables such 
that there are at most k different values in the set {ai, . . . , a„}. Then there is an algorithm with the following 
properties: given n, ai, . . . , a„ and access to independent draws from X, it uses log(n) • 0{k ■ e~^) • log(l/5) 
samples from the target distribution X, runs in time poly(?i'^ • (A;/e)'^^°^ ('=/'=)) . log[l/5), and with probability 
1 — 6 outputs a hypothesis vector p G [0, 1]" defining independent Bernoulli random variables Xi with E[Xj] = 
Pi such that dTy(^i X) < e, where X = X^^Li o,i^i- 

Given a vector a = (oi, . . . , a„) of weights, we refer to a distribution X = J27=i ^i-^i (where Xi, . . . , Xn 
are independent Bernoullis which may have arbitrary means) as an a-weighted sum of Bernoullis, and we write 
Sa to denote the space of all such distributions. 

To prove Theorem |2] we first show that Sa has an e-cover that is not too large. We then show that by running 
a "tournament" between all pairs of distributions in the cover, using the hypothesis testing subroutine from 
Section 12.31 it is possible to identify a distribution in the cover that is close to the target a-weighted sum of 
Bernoullis. 

Lemma 10 There is an e-cover Sa,e C Sa of size \Sa,t\ < {n/kY^ ■ {k/e)^'^^^°^ ^^1'^)) that can be constructed 
in time poly(|5a^e|)- 



Proof: Let {hj}-^-^ denote the set of distinct weights in ai, . . . , a„, and let rij = |{i G [n] | Oj = bj}\. With 
this notation, we can write X = Ylj=i ^j^j = 9{S)^ where S = {Si, . . . , 5^) with each Sj a sum of rij 
many independent Bernoulli random variables and g{yi, ■ ■ ■ , yk) = Yl,j=i ^jVj- Clearly we have ^j^i rij = n. 
By Theorem|4j for each j G {1, . . . ,k} the space of all possible Sj's has an explicit {e/k)-cover 5^„ of size 

|5f/^| < n| • 0{k/e) + n ■ {k/e)'^^^°^^^^/'^\ By independence across 5/s, the product Q = H^'^i ^f/^ is an 
e-cover for the space of all possible S"s, and hence the set 

{Q= ilbjSj : {Si,...,Sk)£Q} 

is an e-cover for Sa- So Sa has an explicit e-cover of size | Q\ = Y[j=i I'^e/fcl — {''^/^T^ ' {k/e)'''^^^°^ (^^Z'^)). 

(We note that a slightly stronger quantitative bound on the cover size can be obtained using Theorem |6] 
instead of Theorem |4l but the improvement is negligible for our ultimate purposes.) ■ 

Lemma 11 Let S be any collection of distributions over a finite set. Suppose that S^ C S is an e-cover of 
S of size N. Then there is an algorithm that uses 0(e~^ log A'^log(l/(^)) samples from an unknown target 
distribution X £ S and with probability 1 — 6 outputs a distribution Z £ S,^ that satisfies d^viX, Z) < 6e. 
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Devroye and Lugosi (Chapter 7 of MDLOlll ) prove a similar result by having all pairs of distributions in the 
cover compete against each other using their notion of a competition, but again there are some small differences: 
their approach chooses a distribution in the cover which wins the maximum number of competitions, whereas our 
algorithm chooses a distribution that is never defeated (i.e. won or achieved a draw against all other distributions 
in the cover). 

Proof: The algorithm performs a tournament by running the competition Choose-Hypothesis^(-ffj, Hj, e, 
5/{2N)) for every pair of distinct distributions Hi, Hj in the cover S^. It outputs a distribution Y* G 5^ that 
was never a loser (i.e. won or achieved a draw in all its competitions). If no such distribution exists in 5e then 
the algorithm outputs "failure." 

Since S^ is an e-cover of S, there exists some Y ^ S^ such that dxy (^, Y) < e. We first argue that with 
high probability this distribution Y never loses a competition against any other Y' G 5^ (so the algorithm does 
not output "failure"). Consider any Y' G S^. If dTvi^, ^') > 4e, by Lemma|9tii) the probability that Y loses 
to Y' is at most 26"""^^/^ = 0{l/N). On the other hand, if dTv{X, Y') < 46, the triangle inequality gives that 
dxy (^, Y') < 5e and thus Y draws against Y'. A union bound over all N distributions in S^ shows that with 
probability 1 — 6/2, the distribution Y never loses a competition. 

We next argue that with probability at least 1 — 6/2, every distribution Y' G S^ that never loses has Y' close 
to X. Fix a distribution Y' such that d^viY', X) > 6e; Lemma|9ti) implies that Y' loses to Y with probability 
1 — 2e~™'^ /2 > 1 — 6/{2N). A union bound gives that with probability 1 — 6/2, every distribution Y' that has 
dTv{Y',X) > 6e loses some competition. 

Thus, with overall probability at least 1 — 6, the tournament does not output "failure" and outputs some 
distribution Y* such that d^vi^: ^*) is at most 6e. This proves the lemma. ■ 

Proof of Theorem 121 We claim that the algorithm of Lemma [TT] has the desired sample complexity and can be 
implemented to run in the claimed time bound. The sample complexity bound follows directly from Lemma [TT] 
It remains to argue about the time complexity. Note that the running time of the algorithm is poly(|5a^e|) times 
the running time of a competition. We will show that a competition between Hi,H2 G Sa,e can be canied out by 
an efficient algorithm. This amounts to efficiently computing the probabilities pi = Hi{Wi) and qi = H2{Wi). 
Note that W = Y.'j=i ^i • {0, 1, . . . , Uj}. Clearly, \W\ < U'j=iinj + 1) = 0{{n/k)''). It is thus easy to see 
that pi , qi can be efficiently computed as long as there is an efficient algorithm for the following problem: given 
H = X]7=i bj'^j ^ 'Sa,e and w G W, compute H{w). Indeed, fix any such H, w. We have that 



H{w)= Yl U^AS: 



1=1 ^ 

where the sum is over all /c-tuples (mi, . . . , m^) such that < tjij < rij for all j and bimi + • • • + 6fc?^fc = ^ 
(as noted above there are at most 0{{n/k)'^) such A;-tuples). To complete the proof of Theorem |2] we note that 
Pr// [S'j = mj] can be computed in O(n^) time by standard dynamic programming. ■ 

We close this subsection with the following remai^k: In recent work HDDS 111 the authors have given a 
poly(£, log(?i), l/e)-time algorithm that learns any ^-modal distribution over [n] (i.e. a distribution whose pdf 
has at most (. "peaks" and "valleys") using 0(£log(n)/e^ + (^/e)^ log(^/e) loglog(^/e)) samples. It is natural 
to wonder whether this algorithm could be used to efficiently learn a sum of n weighted independent Bernoulli 
random variables with k distinct weights, and thus give an alternate algorithm for Theorem |2l perhaps with bet- 
ter asymptotic guarantees. However, it is easy to construct a sum X = Y^^l^i aiXi of n weighted independent 
Bernoulli random variables with k distinct weights such that X is 2'^-modal. Thus a naive application of the 
HDDS 111 result would only give an algorithm with sample complexity exponential in k, rather than the quasi- 
linear sample complexity of our cun^ent algorithm. If the 2'^-modality of the above-mentioned example is the 
worst case (which we do not know), then the HDDS 111 algorithm would give a poly(2^, log(n), l/e)-time algo- 
rithm for our problem that uses 0{2^ log(n)/e^) + 2*^^*^^ • 0(l/e^) examples (so comparing with Theorem [2l 
exponentially worse sample complexity as a function of k, but exponentially better running time as a function of 
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n). Finally, in the context of this question (how many modes can there be for a sum of n weighted independent 
Bernoulli random variables with k distinct weights), it is interesting to recall the result of K.-I. Sato ||Sat93l 
which shows that for any N there are two unimodal distributions X, Y such that X + Y has at least N modes. 

3.2 Sample complexity lower bound for learning sums of weighted independent BemouUis 

Recall Theorem [3l 

Theorem|3]Lef X = Y17=i i'^ibe a weighted sum of unknown independent Bernoulli random variables (where 
the i-th weight is simply i). Let L be any learning algorithm which, given n and access to independent draws 
from X, outputs a hypothesis distribution X such that dxy (^, X) < 1/25 with probability at least e~°("' . Then 
L must use i}{n) samples. 

Proof of Theorem ^ We define a probability distribution over possible target probability distributions X as 
follows: A subset S C {n/2 + 1, . . . , n} of size |5| = n/100 is drawn uniformly at random from all („'](qq) 
possible outcomes.. The vector p = (pi, . . . ,Pn) is defined as follows: for each i £ S the value pi equals 
100/n = 1/1 5*1, and for all other i the value pi equals 0. The i-th Bernoulli random variable Xi has E[Xj] = pi, 
and the target distribution is X = Xp = Y17=i ^"^*- 
We will need two easy lemmas: 

Lemma 12 Fix any S, p as described above. For any j £ {n/2 -\- 1, . . . ,n} we have Xp{j) ^ if and only if 
j G S. For any j £ S the value Xp{j) is exactly (100/?i)(l — lOO/n)"'^'"^"^ > 35/n (for n sufficiently large), 
and hence Xp{{n/2 + 1, . . . , n}) > 0.35 (again for n sufficiently large). 

The first claim of the lemma holds because any set of c > 2 numbers from {n/2 + 1, . . . , n} must sum to more 
than n. The second claim holds because the only way a draw x from Xp can have x = j is if Xj = 1 and all 
other Xi are (here we are using lim2,._>.oo(l — l/x)^ = 1/e). 
The next lemma is an easy consequence of Chernoff bounds: 

Lemma 13 Fix any p as defined above, and consider a sequence of n/2000 independent draws from Xp = 
^^ iXi. With probability 1 — e"^'^'^-' the total number of indices j G [n] such that Xj is ever 1 in any of the 
n/2000 draws is at most n/1000. 

We are now ready to prove Theorem [3] Let L be a learning algorithm that receives ri/2000 samples. Let 
S C {n/2 + 1, . . . , n} and p be chosen randomly as defined above, and set the target to X = Xp. 

We consider an augmented learner L' that is given "extra information." For each point in the sample, instead 
of receiving the value of that draw from X the learner L' is given the entire vector (Xi, . . . , X^) € {0, 1}". Let 
T denote the set of elements j G {n/2 + 1, . . . , n} for which the learner is ever given a vector {Xi, . . . , Xn) 
that has Xj = 1. By Lemma [T3] we have \T\ < ?i/1000 with probability at least 1 — e~^("); we condition on 
the event \T\ < n/1000 going forth. 

Fix any value i < n/1000. Conditioned on \T\ = £, the set T is equally likely to be any ^-element subset of 
S, and all possible "completions" of T with an additional n/100-£ > 9n/1000 elements of {n/2+1, . . . , n}\T 
are equally likely to be the true set S. 

Let H denote the hypothesis distribution over [n] that algorithm L outputs. Let R denote the set {n/2 + 
1, . . . ,n} \ T; note that since \T\ = £ < n/1000, we have \R\ > 499n/1000. Let U denote the set {i G 
R : H{i) > 30/n}. Since H is a. distribution we must have \U\ < 7i/30. Each element in S \U "costs" 
at least 5/n in variation distance between X and H. Since 5" is a uniform random extension of T with at 
most n/100 - £ £ [9n/1000, n/100] unknown elements of R and \R\ > 499n/1000, an easy calculation 
shows that Pr[|5 \ U\ > 8n/1000] is 1 - e"^("). This means that with probabihty 1 - e"^(") we have 
dTviX, H) > ^ • I = 1/25, and the theorem is proved. ■ 
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4 Extensions of the Cover Theorem of (IDPllj 



4.1 Proof of Theorem S 



We only need to argue that the e-covers constructed in MDasOSII and MDPllll satisfy the part of the theorem fol- 
lowing "finally;" we will refer to this part of the theorem as the last part in the following discussion. Moreover, 
in order to avoid reproducing here the involved constructions of MDasOSJ and MDPlll . we will assume that the 
reader has some familiarity with these constructions. Nevertheless, we will try to make our proof self-contained. 

First, we claim that we only need to establish the last part of Theorem |4] for the cover obtained in MDasOSII . 
Indeed, the e-cover of BDPllH is just a subset of the e/2-cover of MDasOSJ . which includes only a subset of the 
sparse form distributions in the e/2-cover of BDasOSI . Moreover, for every sparse form distribution in the e/2- 
cover of MDasOSI . the e-cover of MDPlli includes at least one sparse form distribution that is e/2-close in total 
variation distance. Hence, if the e/2-cover of MDasOSI satisfies the last part of Theorem IH it follows that the 
e-cover of MDPllll also satisfies the last part of Theorem |4l 

We proceed to argue that the cover of MDasOSII satisfies the last part of Theorem IH The construction of 
the e-cover in MDasOSII works roughly as follows: Given an arbitrary collection of indicators {Xj}"^^ with 
expectations E[Xj] = pi for all i, the collection is subjected to two filters, called the Stage 1 and the Stage 2 
filters (see respectively Sections 5 and 6 of MDasOSII ). Using the same notation as MDasOSI let us denote by {Zi}i 
the collection output by the Stage 1 filter and by {Yi}i the collection output by the Stage 2 filter. The collection 
output by the Stage 2 filter is included in the e-cover of HDasOSl . satisfies that d^vij^i -^i^ Si ^«) — ^' ^^^^ i^ 
in either the heavy Binomial or the sparse form. 



Ei Zi and 

(1 + ^^)), 



Let {iiz,CFz) and {hytCTy) denote respectively the (mean, variance) pairs of the variables Z = 
Y = ^j Yi. We argue first that the pair {fiz, (^z) satisfies |/u — fiz\ = 0{e) and |(T^ — a^\ = 0(e 
where ^ and a'^ are respectively the mean and variance of X = ^^ Xj. Next we argue that, if the collection 
{Yi}i output by the Stage 2 filter is in heavy Binomial form, then (//y , Uy) also satisfies |/i — hy\ = 0(e) and 
1^2-41 = 0(1 + e- (1 + ^2)). 

• Proof for (yU^, o"!): The Stage 1 filter only modifies the indicators Xi with pi G (0, l/k) U (1 — l/k, 1), 
for some well-chosen k = 0(l/e) (as in the statement of Theorem 14]). For convenience let us define 
C = {i\pi £ (0, l/k)} and n = {i\pi e {I- l/k, 1)} as in MDasOSI The filter of Stage 1 rounds the 
expectations of the indicators indexed by C to some value in {0, l/k} so that no expectation is altered by 
more than an additive l/k, and the sum of these expectations is not modified by more than an additive 
l/k. Similarly, the expectations of the indicators indexed by T-L are rounded to some value in {1 — l/k, 1}. 
See the details of how the rounding is performed in Section 5 of MDasOSII . Let us then denote by {p'j}i the 
expectations of the indicators {Zi}i resulting from the rounding. We argue that the mean and variance of 
Z = ^^ Zi is close to the mean and variance of X. Indeed, 



l/^-/^z| 



^Pi-^P'. 



Ep^- e 



Pi 



i€Cun 



i£CUH 



<0{l/k) = 0(e). 



(14) 



Similarly, 



1^2 2 I 



^Pi(l - Pi) - ^p'i{l - Pi] 

i i 

^Pi{l - Pi) - ^P'ii'i^ - Pi, 



+ 



^P^{l-Pi)-J2Pi(^-p'>^ 

ien i&H 



We proceed to bound the two terms of the above summation separately. Since the argument is symmetric 



IS 



for £ and T-L we only do C. We have 






5^(pi-pi)(l-(pi+p0) 



(G£ 



^{Pi - p'i) - ^{Pi - p'i){pi + Pi) 



iec 



iec 



< 



^{Pi-p'i. 



iec 



^{Pi - p'i){Pi + p'i 



iec 



- i: + ^\Pi-p'i\(Pi+p'i) 



leC 



iec 



\ iec / 

r:T7;tEB(i-i/t) + i/tj 



1 1 



1 1 

-k^k 

1 1 

< T + TT7 + 



k k"^ k 



iec 



Using the above (and a symmetric argument for index set T-L) we obtain: 



O" 



a'z\<T + T3.+ 



k B k-l 



a2 = 0(e)(l + (72). 



(15) 



• Proof for (^Y, (Ty): After the Stage 1 filter is applied to the collection {Xi}, the resulting collection of 
random variables {Zj} has expectations p'^ G {0, 1} U [1/A;, 1 — 1/k], for all i. The Stage 2 filter has 
different form depending on the cardinality of the set A^ = {i | p^ G [^/k, 1 — 1/A;]}. In particular, if 
\M\ > k^ the output of the Stage 2 filter is in heavy Binomial form, while if if |A^| < k^ the output of 
the Stage 2 filter is in sparse form. As we are only looking to provide a guarantee for the distributions in 
heavy Binomial form, it suffices to only consider the former case next. 

- \M-\ > k^: Let {Yi} be the collection produced by Stage 2 and let Y = Y^- Yi. Then Lemma 6.1 
in MDasOSII imphes that 

\^J'Z — 1^y\ = 0(e) and \(t\ ~ ^y\ = 0(1). 
Combining this with (fT4l) and (IT5] ) gives 

1^ - ^y\ = 0{e) and \a'^ - a^\ = 0(1 + e • (1 + a"^)). 
This concludes the proof of Theorem ID ■ 
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4.2 Improved Version of Theorem |4] 

In our new improved version of the Cover Theorem, the fc-heavy Binomial Form distributions in the cover are 
actually Binomial distributions Bin(^, q) (rather than translated Binomial distributions as in the original version) 
for some £ < n and some q which is of the form (integer)/^ (rather than q of the form (integer)/ (A;n) as in the 
original version). This gives an improved bound on the cover size. For clarity we state in full the improved 
version of Theorem |4] below: 

Theorem 6 (Cover for PBDs, stronger version) For all e > 0, there exists an e-cover Se QSofS such that 



1. \SA<n' + n-af'^'°''"'''\and 



lNO(log2l/e) 



2. The set S^ can be constructed in time linear in its representation size, i.e. 0{'n?) + 0{n) • (-) 

Moreover, if {Zi} G S^, then the collection {Zi} has one of the following forms, where k = k{e) < C/e is a 
positive integer, for some absolute constant C > 0: 

(i) (Sparse Form) There is a value I <k^ = 0{l/e') such that for alii < (.we have^[Zi] G s p-, p-, • • • , f~i^ \, 
and for alii > d. we have E[Zj] G {0, 1}. 



(ii) (Binomial Form) There is a value i G {0, 1, . . . , n} and a value q G {-,-,..., -^^^j such that for all 
i < i we have E[Zj] = q; for all i > iwe have ^[Zi] = 0; and I, q satisfy the bounds £q > k"^ — 2 — ^ 
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and£q{l - g) > fc^ - A; - 3 - |. 



Finally, for every {Xi} G Sfor which there is no e-neighbor in S^ that is in sparse form, there exists a collection 
{Zi} G Se in Binomial form such that 

(Hi) dTv(J2i ^i^J2i^i) ^ ^' <^'^d 

(iv) ifn = B[J2i Xi], fi = B[J2, Zi], ^2 = Var[^. Xi] and a"^ = Var^ • Zi], then \fi - fl\ = 2 + 0{e) and 

Proof: Suppose that X = {Xi} G 5 is a PBD that is not ei-close to any Sparse Form PBD in the cover Se^ of 
Theorem m where ei = ©(e) is a suitable (small) constant multiple of e (more on this below). Let fi, a^ denote 
the mean and variance of ^^ Xj. Parts (iii) and (iv) of Theorem |4] imply that there is a collection {YJ} G S^^ 
in /c-heavy Binomial Form that is close to ^,- Xi both in variation distance and in its mean i^i! and variance cr'^. 
More precisely, let £, q be the parameters defining {1^} as in part (ii) of Theorem |4] and let /i', cr'^ be the mean 
and variance of ^^ Yi\ so we have fi' = iq + t for some integer < t < n — £ and a'^ = £q{l — q) > il(l/ef ) 
from part (ii). This implies that the bounds \fi — fi'\ = 0{ei) and {a"^ — a''^\ = 0(1 + ei • (1 + a^f) of (iv) 
are at least as strong as the bounds given by Equation ^ (here we have used the fact that ei is a suitably small 
constant multiple of e), so we may use the analysis of Section 12.21 The analysis of Section 12.21 (Claim [T] and 
LemmalSl gives that d'Yy{X, TP{fi',a'^)) < 0(ei). 

Now the analysis of Locate-Binomial (from Section 1241 ) implies that TP{fi', cr'^) is 0(ei)-close to a 
Binomial distribution Bin(n,p). We first observe that in Step 2.a of Section l2!4l the variance a'"^ = £q{l — q) 
is at most n/4 and so the af that is defined in Step 2.a equals a'"^. We next observe that by the Cauchy-Schwai^z 
inequality we have //'^ < n(/i' — o"'^), and thus the value (t| defined in Step 2.b of Section 124] also equals a'^. 
Thus we have that the distribution Bin(n,p) resulting from Locate-Binomial is defined by 



n 



tf 



+ t 



and p 



iq + t 



So we have established that X is 0(ei)-close to the Binomial distribution Bin(n,p). We first establish 
that the parameters n,p and the corresponding mean and variance p = np, a"^ = hp{l — p) satisfy the bounds 
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claimed in parts (ii) and (iv) of Theorem [6l To finally prove the theorem we will take I = n and ^ to be j5 rounded 
to the nearest integer multiple of 1/n, and we will show that the Binomial distribution Bin(l^, q) satisfies all the 
claimed bounds. 

If t = then it is easy to see that h = i and p = q and all the claimed bounds in parts (ii) and (iv) of 
Theorem [6] hold as desired for n, p, jl and a"^ . Otherwise t > 1 and we have 



iq^ + t J \lq + t 

and similarly 



fTT-' -[l^TT -[^^Tr r^'^'-'^-"^'-''^-'' -'-'-P 



so we have the bounds claimed in (ii). Similarly, we have 

+ t J '\£q + t 

so from part (iv) of Theorem |4] we get the desired bound \n — fi\ < I + 0(e) of Theorem [6l Recalling that 
a''^ = £q{l — q), we have shown above that o"^ > o"'^ — 1; we now observe that 



f.' = i, + t=i'-^^i^].iZ^]>f. = np>,'-l 



^,2 ^ V:^^^ . t±^^ . tl^±_ > np{l -p) = a' 



+ tf \ / y+A / tq- Iq^ 
"^+1 )\eq + t)\ iq + t 

so from part (iv) of Theorem |4] we get the desired bound |o"^ — (T^| < 0(1 + e(l + cr^)) of Theorem |6] 

Finally, we take I = n and ^ to be p rounded to the nearest multiple of 1/n as described above; Z = 
Bin(l^, q) is the desired Binomial distribution whose existence is claimed by the theorem, and the parameters 
Jl, cj^ of the theorem are /2 = Iq, a"^ = Iq{l — q). Passing from Bin(n,p) to Bin(^, q) changes the mean and 
variance of the Binomial distribution by at most 1, so all the claimed bounds from parts (ii) and (iv) of Theorem[6] 
indeed hold. To finish the proof of the theorem it remains only to show that (iTV'(Bin(Z, p), Bin(l, q)) < 0(e). 
Similar to Section [Z2] this is done by passing through Translated Poisson distributions. We show that 

dTv{Bm{lp),TP{ip,ep{l-p))), dTv{TP{Ip,£p{l - p)),TP{£q,£q{l - q))), and 
dTv{TP{£qJq{l - q)),Bm{lq)) 
are each at most 0(e), and invoke the triangle inequality. 

1. Bounding (iTi/(Bin(^,p), rP(£p, £p{l — p))): Using Lemma[T] we get 

dTv{Bm{lp),TP{ep,ip{l-p))) < — =L==+ ^ 



Since ip = hp > k^ — 1/k = 0(l/e^) we have that the RHS above is at most 0(e). 

2. Bounding dTv{TP{lp, ip{l- p)),TP{iq, iq{l - q))): Let ct^ denote mm{ip{l - p),iq{l - q)}. Since 
\q-p\ < 1/n, we have that Ig(l-g) = £p{l - p) ±0{1) = ^(l/e^) so a = il(l/e). We useLemmaH 
which tells us that 

dMTpmPii-m,TPiiqMi-m < H^^ + Mi^i)^|(i^i^±l. (i6) 

Since |p — g| < 1/n, we have that \ip — Iq\ = I\p — q\ < l/n < 1, so the first fraction on the RHS 
of ([T6]is 0(e). The second fraction is at most (0(1) + l)/a^ = 0(6^), so we get dTv{TP{ipJp{l - 
p)), TP{lq,£q{l - q))) < 0(e) as desired. 

3. Bounding d^v{TP{iq, iq{l — q)), Bin(£, q)): We use Lemma [T] similar to the first case above, together 
with the lower bound a = Q{l/e), to get the desired 0(e) upper bound. 

This concludes the proof of Theorem [6] ■ 
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5 Birge's theorem: Learning unimodal distributions 

Here we briefly explain how Theorem |5] follows from ||Bir97i We assume that the reader is moderately familiar 
with the paper IIBir97l . 

Birge (see his Theorem 1 and Corollary 1) upper bounds the expected variation distance between the target 
distribution (which he denotes /) and the hypothesis distribution that is constructed by his algorithm (which 
he denotes /„; it should be noted, though, that his "n" parameter denotes the number of samples used by the 
algorithm, while we will denote this by "ttt,", reserving "n" for the domain {1, . . . ,n} of the distribution). 
More precisely, ||Bir97| shows that this expected variation distance is at most that of the Grenander estimator 
(applied to learn a unimodal distribution when the mode is known) plus a lower-order term. For our Theorem [5] 
we take Birge's "ry" parameter to be e. With this choice of r/, by the results of IIBir87a[ IBir87bll bounding the 
expected en^or of the Grenander estimator, if m = 0(log(n)/e^) samples are used in Birge's algorithm then 
the expected variation distance between the target distribution and his hypothesis distribution is at most 0(e). 
To go from expected error e to an e-accurate hypothesis with probability 1 — 5, we run the above-described 
algorithm O (log (1/5)) times so that with probability at least 1 — 5 some hypothesis obtained is e-accurate. Then 
we use our hypothesis testing procedure of Lemma [8l or, more precisely, the extension provided in Lemma [TT] 
to identify an 0(e)-accurate hypothesis. (The use of Lemma [TTlis why the running time of Theorem [5] depends 
quadratically on \og{l/5).) 

It remains only to argue that a single run of Birge's algorithm on a sample of size m = 0(log(n)/e^) can be 
carried out in 0(log^(n)/e^) bit operations (recall that each sample is a log(n)-bit string). His algorithm begins 
by locating an r G [n] that approximately minimizes the value of his function d{r) (see Section 3 of IIBir97ll ) to 
within an additive i] = e (see Definition 3 of his paper); intuitively this r represents his algorithm's "guess" at 
the true mode of the distribution. To locate such an r, following Birge's suggestion in Section 3 of his paper, we 
begin by identifying two consecutive points in the sample such that r lies between those two sample points. This 
can be done using log m stages of binary search over the (sorted) points in the sample, where at each stage of the 
binary search we compute the two functions dr and d^ and proceed in the appropriate direction. To compute 
the function d~ {j) at a given point j (the computation of (i+ is analogous), we recall that d~ {j) is defined as 
the maximum difference over [1, j] between the empirical cdf and its convex minorant over [l,i]. The convex 
minorant of the empirical cdf (over m points) can be computed in O((log n)m) bit-operations (where the log n 
comes from the fact that each sample point is an element of [n]), and then by enumerating over all points in the 
sample that lie in [1, j] (in time 0((log n)m)) we can compute d~ (j). Thus it is possible to identify two adjacent 
points in the sample such that r lies between them in time 0((logn)m). Finally, as Birge explains in the last 
paragraph of Section 3 of his paper, once two such points have been identified it is possible to again use binary 
search to find a point r in that interval where d{r) is minimized to within an additive rj. Since the maximum 
difference between d~ and d+ can never exceed 1, at most log(l/r7) = log(l/e) stages of binary search are 
required here to find the desired r. 

Finally, once the desired r has been obtained, it is straightforward to output the final hypothesis (which Birge 
denotes /„. As explained in Definition 3, this hypothesis is the derivative of F^*^, which is essentially the convex 
minorant of the empirical cdf to the left of r and the convex majorant of the empirical cdf to the right of r. As 
described above, given a value of r these convex majorants and minorants can be computed in 0((log?i)7n) 
time, and the derivative is simply a collection of uniform distributions as claimed. This concludes our sketch of 
how Theorem IDfollows from ||Bir97|| . 

6 Efficient Evaluation of the Poisson Distribution 

In this section we provide an efficient algorithm to compute an additive approximation to the Poisson probability 
mass function. This seems like a basic operation in numerical analysis, but we were not able to find it explicitly 
in the literature. 

Before we state our theorem we need some notation. For a positive integer n, denote by |n| its description 
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complexity (bit complexity), i.e. \n\ = [log2 n\ . We represent a positive rational number q as ^, where qi,q2 
are relatively prime positive integers. The description complexity of q is defined to be \q\ = \qi\ + \q2\- We are 
now ready to state our theorem for this section: 

Theorem 7 There is an algorithm that, on input a rational number A > 0, and integers k > and t > 0, 
produces an estimate pk such that 

, 1 

\Pk-Pk\ < -, 

where pk = — |i — is the probability that the Poisson distribution of parameter A assigns to integer k. The 
running time of the algorithm is 0(|t|^ + |A;| • |t| + |A| • |t|). 

Proof: Cleai^ly we cannot just compute e"'^, A'^ and k\ sepai^ately, as this will take time exponential in the 
description complexity of k and A. We follow instead an indirect approach. We start by rewriting the target 
probability as follows 

Pk = e-A+fcln(A)-ln(fc!)_ 

Motivated by this formula, let 

El, ■=-X + kln{X)-ln{k\). 

Note that Ej, < 0. Our goal is to approximate Ej, to within high enough accuracy and then use this approxima- 
tion to approximate pk- 

In particular, the main part of the argument involves an efficient algorithm to compute an approximation Ek 
to Efi satisfying 

Ek-Ek < — < ^. (17) 

This approximation has bit complexity 0{\k\ + 1 A| + |i|) and can be computed in time 0{\k\ ■ \t\ + |A| + |t|^). 
We first show how to use such an approximation to complete the proof. We claim that it suffices to approxi- 
mate e^'' to within an additive eiTor ^. Indeed, if p^ is the result of this approximation, then: 

^ 2i - 2t - 2t 

\ 2t) 2t -^ t 
and similarly 

^^ - 2t - 2t - 2t 

> e^'-V I 1 + — 1 > e^'-- I 1 I > Pk- -■ 

/ \ 2t) 2t - V 2ty 2t - t 

We will need the following lemma: 

Lemma 14 Let a < be a rational number There is an algorithm that computes an estimate e° such that 

1 
< — 

- 2t 



and has running time Oi\a\ ■ \t\ + |tp). 
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Proof: Since e" G [0, 1], the point of the additive grid {^}|1^ closest to e" achieves error at most l/(4i). 
Equivalently, in a logarithmic scale, consider the grid {In ji}fLi and let j* := arg mirij < a — In(^) >. Then, 



we have that 



r 



1 
< — . 

- 4t 



m 

The idea of the algorithm is to approximately identify the point j*,by computing approximations to the points of 
the logarithmic grid combined with a binary search procedure. Indeed, consider the "rounded" grid {In ■^}fLi 

where each In(^) is an approximation to In(^) that is accurate to within an additive ^. Notice that, for 
i = l,...,4t: 

Given that our approximations are accurate to within an additive l/16t, it follows that the rounded grid {In ^}fLi 
is monotonic in i. 

The algorithm does not construct the points of this grid explicitly, but adaptively as it needs them. In 

particular, it performs a binary search in the set {1, ... , 4t} to find the point i* := arg minj < a — In(^) >. In 
every iteration of the search, when the algorithm examines the point j, it needs to compute the approximation 

gj = In ( 1^ ) and evaluate the distance \a—gj\. It is known that the logarithm of a number x with a binary fraction 
of L bits and an exponent of o{L) bits can be computed to within a relative eiTor 0(2"^) in time 0{L) IIBre75ll . 
It follows from this that Qj has 0(|t|) bits and can be computed in time 0(|t|). The subtraction takes linear 
time, i.e. it uses 0{\a\ + |t|) bit operations. Therefore, each step of the binary search can be done in time 
0(|a|) + 0{\t\) and thus the overall algorithm has 0{\a\ -1^1) + 0(|ip) running time. 

The algorithm outputs |j as its final approximation to e". We argue next that the achieved eiTor is at most 
an additive ^. Since the distance between two consecutive points of the grid {In ^}fLi is niore than l/(8t) and 
our approximations are accurate to within an additive l/16t, a little thought reveals that i* G {j* — l,j*,j* + 1}. 
This implies that |^ is within an additive l/2t of e" as desired, and the proof of the lemma is complete. ■ 

We now proceed to describe how to approximate e^'' . Recall that we want to output an estimate pk such 
that \pk — 6^*= I < l/{2t). We distinguish the following cases: 



• If Ek > 0, we output pfc := 1. Indeed, given that 



Ek — Ek 



< l^ and Ek < 0, if Ek >0 then Eke [0, ^] 



Hence, because t > 1, e^*^ G [1, 1 + l/2t], so 1 is within an additive l/2f of the right answer. 

• Otherwise, pk is defined to be the estimate obtained by applying Lemma [141 for a := Ek- Given the bit 
complexity of Ek, the running time of this procedure will be 0{\k\ ■ \t\ + |A| • |t| + |tp). 

Hence, the overall running time is 0{\k\ • |t| + |A| • |f | + \t\^). 
We now show how to compute Ek- There are several steps to our approximation: 
1. (Stirling's Asymptotic Approximation): Recall Stirling's asymptotic approximation (see e.g. MWhiSOl 

lnk\ = k In(fe) -k+ (1/2) • ln(2^) + V ,^^' " ^f^ , + 0(1/A:™). 

^JU -f)-ki 

where Bk are the Bernoulli numbers. We define an approximation of In kl as follows: 

™° R (—^V 

In kl := k ln(k) -k + (1/2) • ln(2^) + V ., . ^ " \ /. 



^.^2j(i-i)-^^"' 
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< 1 Hk\) - 


-ln{k\)\ <0{l/k'^° 




< \ 



for mo := [^^1 + 1. 

2. (Definition of an approximate exponent Ek): Define E^ := —A + A;ln(A) — ln(A:!). Given tlie above 
discussion, we can calculate the distance of E^ to the true exponent E^ as follows: 

\Ek -Ek\<\ Hk\) - Mk\)\ < ©(lA"*") (18) 

(19) 

So we can focus our attention to approximating E^. Note that Ek is the sum of ttiq + 2 = 0(j^^) terms. 
To approximate it within eiTor l/(10t), it suffices to approximate each summand within an additive eiTor 

of 0(1/ {t • log t)). Indeed, we so approximate each summand and our final approximation E^ will be the 
sum of these approximations. We proceed with the analysis: 

3. (Estimating 2tt): Since 27r shows up in the above expression, we should try to approximate it. It is known 
that the first £ digits of vr can be computed exactly in time 0(log^ • M{£)), where M{£) is the time to 
multiply two £-bit integers ||Sal76l|Bre76i For example, if we use the Schonhage-Strassen algorithm for 
multiplication IISS71II . we get M{£) = 0{(. • log^ • log log ^). Hence, choosing i := [log2(12t • logt)], 
we can obtain in time 0(|t|) an approximation 27r of 2ir that has a binary fraction of i bits and satisfies: 

|2^-27r|<2~^ => (1 - 2-^)27r < 2^ < (1 + 2~^)27r. 

Note that, with this approximation, we have 

ln(27r) - ln(2^) < ln(l - 2"^) < 2"^ < l/(12t • logt). 

4. (Floating-Point Representation): We will also need accurate approximations to ln27r. Ink and In A. We 
think of 27r and k as multiple-precision floating point numbers base 2. In particular, 

• 27r can be described with a binary fraction of ^ + 3 bits and a constant size exponent; and 

• k = 2^'°^'^! . p ,^^ ^. ^ can be described with a binary fraction of [log fc] , i.e. | A;|, bits and an exponent 
of length 0(loglogA;), i.e. 0(log \k\). 

Also, since A is a positive rational number, A = ^, where Ai and A2 are positive integers of at most 
|A| bits. Hence, for i = 1,2, we can think of Aj as a multiple-precision floating point number base 
2 with a binary fraction of |A| bits and an exponent of length 0(log |A|). Hence, if we choose L = 
[log2(12(3fc + \)t^ ■ k ■ \i ■ A2)] = 0{\k\ + |A| + \t\), we can represent all numbers 27r, Ai, A2, k as 
multiple precision floating point numbers with a binary fraction of L bits and an exponent of O(log-L) 
bits. 

5. (Estimating the logs): It is known that the logarithm of a number x with a binary fraction of L bits and an 
exponent of o{L) bits can be computed to within a relative eiTor 0(2"^) in time 0{L) IIBre75l . Hence, 

in time 0{L) we can obtain approximations In 27r, In k, In Ai, In A2 such that: 

• \\nk — In k\ < 2~-^ln k < j^izhriW'' ^'^'^ similarly 

• llnAi-lnAil < Y2(3^i)F'fo" = l'2; 

• |ln2^-ln2;f|< i2(3A:W ' 
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6. (Estimating the terms of the series): To complete the analysis, we also need to approximate each term of 



the form c. 



B, 



^] — ■( --ii fcj-i "P ^^ ^"^ additive eiTor of 0(l/(t • log t)). We do this as follows: We compute 
the numbers Bj and y~^ exactly, and we perform the division approximately. 

Clearly, the positive integer y^^ has description complexity j •\k\ = 0{mQ • \k\) = 0{\t\ + |A;|), since 
j = O{mo). We compute y~^ exactly using repeated squaring in time 0{j ■ \k\) = 0{\t\ + \k\). It is 
known ||Fil92| that the rational number Bj has 0{j) bits and can be computed in 0{j'^) = 0(|tp) time. 
Hence, the approximate evaluation of the term Cj (up to the desired additive error of l/(tlogt)) can be 
done in 0(|tp + |A;|), by a rational division operation (see e.g. MKnuSlI ). The sum of all the approximate 
terms takes linear time, hence the approximate evaluation of the entire truncated series (comprising at 
most ruQ < \t\ terms) can be done in 0(|tp + \k\ • |i|) time overall. 

Let Ef, be the approximation arising if we use all the aforementioned approximations. It follows from the 
above computations that 



Ek — Ek 



< 



1 



lot 



7. (Overall Error): Combining the above computations we get: 

Ek — Ek 



1 
< — . 

- At 



The overall time needed to obtain Ek was 0(|/c|-|t| + |A| + |t|^) and the proof of the theorem is complete. 



7 Conclusion and open problems 

While we have essentially settled the sample and time complexity of learning an unknown Poisson Binomial 
Distribution to high accuracy, several natural goals remain for future work. One goal is to obtain a proper learn- 
ing algorithm which is as computationally efficient as our non-proper algorithm. Another goal is to understand 
the sample complexity of leai^ning log-concave distributions over [n] (a distribution X over [n] is log-concave 
if p'i > Pi+iPi-i for every i, where pj denotes Pr[X = j]). Every PBD over [n] is log-concave (see Sec- 
tion 2 of IIKG71I ). and every log-concave distribution over [n] is unimodal; thus this class lies between the 
class of PBDs (now known to be learnable from 0{l/e^) samples) and the class of unimodal distributions (for 
which il(log(n)/e^) samples are necessary). Can log-concave distributions over [n] be leai^ned from poly(l/e) 
samples independent of nl If not, what is the dependence of the sample complexity on n? 
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